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Abstract. Graphs and networks are used to model interactions in a variety of contexts. There is 
a growing need to quickly assess the characteristics of a graph in order to understand its underlying 
structure. Some of the most useful metrics are triangle-based and give a measure of the connected- 
ness of mutual friends. This is often summarized in terms of clustering coefHcients, which measure 
the likelihood that two neighbors of a node are themselves connected. Computing these measures 
exactly for large-scale networks is prohibitively expensive in both memory and time. However, a 
recent wedge sampling algorithm has proved successful in efficiently and accurately estimating clus- 
tering coefHcients. In this paper, we describe how to implement this approach in MapReduce to 
deal with extremely massive graphs. We show results on publicly-available networks, the largest of 
which is 132M nodes and 4.7B edges, as well as artificially generated networks (using the GraphSOO 
benchmark), the largest of which has 240M nodes and 8.5B edges. We can estimate the clustering 
coefficient by degree bin (e.g., we use exponential binning) and the number of triangles per bin, as 
well as the global clustering coefficient and total number of triangles, in an average of 0.33 sec. per 
million edges plus overhead (approximately 225 sec. total for our configuration). The technique can 
also be used to study triangle statistics such as the ratio of the highest and lowest degree, and we 
highlight differences between social and non-social networks. To the best of our knowledge, these are 
the largest triangle-based graph computations published to date. 

Keywords: triangle counting, clustering coefficient, triangle characteristics, large- 
scale networks, MapReduce 

1. Introduction. Over the last decade, graphs have emerged as the standard 
for modeling interactions between entities in a wide variety of applications. Graphs 
are used to model infrastructure networks, the world wide web, computer traffic, 
molecular interactions, ecological systems, epidemics, co-authors, citations, and social 
interactions, among others. Understanding the frequency of small subgraphs has been 
an important aspect of graph analysis. 

Despite the differences in the motivating applications, some topological structures 
have emerged to be important across all these domains. The most important such 
subgraph is the triangle (3-clique). Many networks, especially social networks, are 
known to have many triangles. This is thought to be because social interactions exhibit 
homophily (people befriend similar people) and transitivity (friends of friends become 
friends) . The notion of clustering coefficient is inspired by this observation, and is the 
standard method of summarizing triangle counts [50, 34]. It is well known that some 
networks, especially social networks, have much higher clustering coefficients than 
random networks [32, 33, 35]. Triangle measures are important for understanding 
network structure and evolution [21, 39, 18, 49] and reproducing the degree-wise 
clustering coefficients of a network is important for generative models [19, 39, 41]. 

1.1. Our Contributions. For large graphs, computing triangle-based measures 
can be extremely expensive. The standard approach is to find all wedges, i.e., paths 
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of length 2, and check to see if they are closed, i.e., the edge that complete the 
triangle exists. Previous work presents a wedge-sampling approach for approximating 
clustering coefhcients [40, 43]; this is in contrast to sampling single edges, which is a 
more obvious but less reliable technique. In [43], it is shown that the wedge-sampling 
approach is orders of magnitude faster than enumeration and is both faster and has 
less variance than edge-sampling techniques. 

In this paper, we show that the wedge-sampling approach scales to very large 
networks. Specifically, we make the following contributions: 

• We present a parallelization of our wedge-based sampling algorithm in the 
MapReduce framework. The basic premise of wedge sampling is to set up a dis- 
tribution on the vertices (as potential wedge centers) and use that to sample the 
actual wedges. Designing a serial algorithm is fairly straightforward, since the infor- 
mation to compute the distributions and then actually form the wedges is all local. 
In the MapReduce implementation, the edges are distributed arbitrarily; therefore, it 
takes several passes through the data to compute the necessary distribution, actually 
create the sample wedges, and finally check if they are closed. 

• An advantage of the MapReduce approach is that we can compute multiple 
clustering coefhcients for the same graph (e.g., binned by degree) for essentially the 
same cost as computing the single global clustering coefhcient. Since the clustering 
coefficient generally varies with degree, it is helpful to see the profile versus a single 
value. Such profiles are useful in graph characterization and modeling. 

• We give extensive numerical results, on both real-world networks from the Lab- 
oratory for Web Algorithms as well as artificial large-scale networks created according 
to the GraphSOO benchmark. We have multiple examples with more than a billion 
edges. To the best of our knowledge, these are the largest triangle computations to 
date. 

• These results demonstrate the efficient of our algorithm, and the costs are quite 
reasonable. For instance, we estimate the cost of computing clustering coefficients 
per (logarithmically) binned degree to be an average of 0.33 sec. per million edges 
plus overhead, which is approximately 225 sec. total for our 32-node Hadoop cluster. 
Hence, a graph with over 9B edges requires less than one hour of computation. Note 
that global clustering coefficient and total triangles are also computed. 

• A straightforward implementation requires that the entire edge list be "shuf- 
fled" three times. We show how to greatly reduce the shuffle volume with some 
clever implementation strategies that are able to filter the edge list during the "map" 
phase. We discuss the implementation details and show numerical comparisons of 
performance. 

• A feature of wedge-based sampling is that the closed wedges are uniform random 
triangles. Hence, we also give experimental results characterizing triangles in terms 
of the their minimum and maximum degrees. Triangles from social networks tend to 
be somewhat assortative whereas triangles from other types of networks are not. 

1.2. Related Work: Other Work on Triangle Counting. The enumera- 
tion algorithms for finding triangles are either node- or edge-centric. Node-centric 
algorithms iterate over all nodes and, for each node v, checks all pairs among the 
neighbors of v for being connected. Edge-centric algorithms, on the other hand, go 
over all edges (m, v) and seeks common neighbors of u and v. Chiba and Nishizeki [11] 
proposed a node-centric algorithm that orders the vertices by degree and processes 
each edge only once, by its lower-degree vertex. They showed that this algorithm runs 
in 0(mQ;(G'))-time, where m is the number of edges, and a(G) is the arboricity of the 
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graph G (arboricity is defined as the minimum number of forests into which its edges 
can be partitioned and can be considered as a measure of how dense the graph is). 
Schank and Wagner [40] used the same idea for their forward algorithm. Cohen [14] 
and Suri and Vassilvitskii [44] independently proposed MapReduce algorithms based 
on the same idea. Chu and Cheng studied an I/O efficient implementation of the 
same algorithm [13]. Latapy proved that the forward algorithm runs in 0(TO^/^)-time 
and proposed improvements that reduce the search space [26]. Latapy also showed 
that the runtime of this algorithm becomes 0{mv}^'^) for graphs with power-law 
degree distributions, where a is the power-law coefficient and n is the number of ver- 
tices [26]. Arifuzzaman et al. [2] give a massively parallel algorithm for computing 
clustering coefficients. 

Enumeration algorithms however, can be very expensive, due to the extremely 
large number of triangles (see e.g., Tab. 5.1), even for graphs even of moderate size 
(millions of vertices). Eigenvalue/ trace based methods adopted by Tsourakakis [47] 
and Avron [3] compute estimates of the total and per-degree number of triangles. 
However, the compute-intensive nature of eigenvalue computations (even just a few 
of the largest magnitude) makes these methods intractable on large graphs. 

Most relevant to our work are sampling mechanisms. Tsourakakis et al. [45] ini- 
tiated the sparsification methods, the most important of which is Doulion [48]. This 
method sparsifies the graph by retaining each edge with probability p; counts the tri- 
angles in the sparsified graph; and multiplies this count by to predict the number 
of triangles in the original graph. Theoretical analyses of this algorithm (and its vari- 
ants) have been the subject of various studies [24, 46, 36]. One of the main benefits 
of Doulion is its ability to reduce large graphs to smaller ones that can be loaded into 
memory. However, the estimates can suffer from high variance [52]. Alternative sam- 
pling mechanisms have been proposed for streaming and semi-streaming algorithms 
[4, 22, 5, 10]. 

Earlier work by a subset of the current authors shows that the wedge-sampling 
approach featured here provides the same accuracy and speed advantages of other 
sampling-based methods (like Doulion) but has a hard bound on the variance [43]. 
Moreover, the wedge-based approach is much more fiexible and can compute a variety 
of triangle-based metrics including degree-wise clustering coefficients and uniform 
randomly sampled triangles. 

1.3. Related Work: MapReduce for Graph Analytics. MapReduce [15] 
is a conceptual programming model for processing massive data sets. The most pop- 
ular implementation is the open-source Apache Hadoop [1] along with the Apache 
Hadoop Distributed File System (HDFS) [1], which we have used in our experiments. 
MapReduce assumes that the data is distributed across storage in roughly equal-sized 
blocks. The MapReduce paradigm divides a parallel program into two parts: a map 
step and a reduce step. During the map step, each block of data is assigned to a 
mapper which processes the data block to emit key-value pairs. The mappers run in 
parallel and are ideally local to the block of data being processed, minimizing com- 
munication overhead. In between the map and reduce steps, a parallel shuffle takes 
place in order to group all values for each key together. This step is hidden from the 
user and is extremely efficient. For every key, its values are grouped together and sent 
to a reducer, which processes the values for a single key and writes the result to file. 
All keys are processed in parallel. 

MapReduce has been used for network and graph analysis in a variety of contexts. 
It is a natural choice for processing, if for no other reason than the fact that it is widely 
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deployed [29]. Pegasus [23] is a general library for large-scale graph processing; the 
largest graph they considered was 1.4M vertices and 6.6M edges and the PageRank 
analytic, but they did not report execution times. Lin and Schatz [31] propose some 
special techniques for graph algorithms such as PageRank that have matrix-vector 
products. MapReduce sampling-based techniques that reduce the overall graph size 
are discussed by Lattanzi et al. [27]. 

In terms of triangle counting and computer clustering coefficients, Cohen [14] 
considers several different analytics including triangle and rectangle enumeration. 
Plantenga [38] has studied subgraph isomorphism (i.e., finding small graph patterns 
such as triangles), including Cohen's algorithm as a special case. (We use Plantenga's 
implementation of Cohen's Triangle enumeration algorithm for comparison in our sub- 
sequent numerical results.) For a non-triangle pattern, Plantenga's SGI code ran on a 
7.6B vertex graph with 107B undirected edges in 620m on a 64-node Hadoop cluster. 
Wu et al. [51] have also studied triangle enumeration using MapReduce with running 
times of roughly 175 seconds on a graph with 1.6M nodes and 5.7M edges. Suri and 
Vassilvitskii [44] proposed a MapReduce implementation for exact per-node cluster- 
ing coefficients. Most naive partitioning schemes do not give efficient parallelization 
because of high-degree vertices, and their result involves new partitioning methods to 
avoid this problem. They run on the same Twitter data set that we use in §5 and 
report a computation time of 483m. SAHAD [54] has a Hadoop program that uses 
sampling techniques based on graph coloring to find subgraphs, but is limited to tree 
patterns. 

2. Background. 

2.1. Global Clustering Coefficient. Let G — {V, E) be an undirected graph 
with n = \V\ nodes and m — \E\ undirected edges. Without loss of generality, we 
assume the vertices are indexed by i = 1, . . . , n. Let di denote the degree of vertex i. 
A wedge is a length-2 path. Let pi denote the number of wedges centered at vertex i. 
It can easily be shown that pi is given by 



A wedge is closed if its endpoints are connected and open otherwise. The center of 
a wedge is the middle vertex. A triangle is a cycle with three vertices. A closed 
wedge forms a triangle; conversely, a triangle corresponds to three closed wedges. Let 
ti denote the number of triangles containing node i, which is equal to the number of 
closed wedges centered at node i. The node-level clustering coefficient (first used in 



Thus, Ci measures how tightly the neighbors of a vertex are connected amongst them- 
selves. 

We define W to be the set of all wedges in G and p to be the total number of 
wedges, i.e., p = \ W\ = J2iPi- We partition W into two disjoint subsets as follows: 




[50]) is 



ti number of triangles incident to node i 



Q = — 



Pi number of wedges centered at node i 



Wo = { w £ W \ w open } , 
Ws = {w eW \ w closed } 
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n = 6, m = 7, {dj = {2,2,3,4,2, 1} 

p= 12, {pj = {1,1,3,6,1,0} 

t = 1, {ti} = {0,0,1, 1, 1,0} 

c = 0.25, {cj = {0, 0, 1/3, 1/6, 1, 0} 

{na} = {1,3,1,1}, {pd} = {0,3,3,6} 

{ta} = {0, 1, 1, 1}, {ca} = {0, 1/3, 1/3, 1/6} 



Fig. 2.1: Example graph with various quantities highlighted. 



The subscript of 3 for the closed wedges indicates that each triangle creates 3 wedges in 
W3. Let t = ^ ti = 1 1 denote the total number of triangles (since each triangle 
is counted thrice). The (global) clustering coefficient (also known as the transitivity) 
[34] of an undirected graph is given by 

l^^al J^^i 3t 3 X total number of triangles (2 1) 

\W\ J2Pi P total number of wedges 

At the global level, c is an indicator of how tightly the nodes of the graph are connected 
and any community structure. 

2.2. Binned Degree-wise Clustering CoefHcient. In this paper, we will 
be using the binned degree-wise clustering coefficients, which measure how tightly 
the neighborhood of vertices of a specified degree group are connected. Let D C 
{di,dj, . . .} be a subset of degrees (we ignore degree-zero nodes). We define Vd = 
{iGV\di€D} and njj — \ Vd\- In many cases, we are interested in a single degree, 
i.e., if I? = { d } then Vd is the set of nodes of degree d and Ud is the number of nodes 
of degree d. 

We define Wjj to be the set of all wedges centered at a node in and p]j to be 
the total number of wedges centered at nodes in Vd, i.e., po — \Wd\- If D = {d}, 
then pd ~ ^(1(2) ■ We partition the set Wd into four disjoint subsets as follows: 

Wofi — {we Wd I w open } , 

WD,q — {we Wd I w closed and has q nodes in Vd } for q — 1, 2, 3. 

Define pD.q — \WD,q\ for g = 0, 1,2,3. Clearly, pD — J2qPD.,q- Then we can define 
the binned degree-wise clustering coefficient, cd, as the fraction of closed wedges in 
Wd] i.e., 

CD = {PDS + Pd,2 + Pd,z)/p- (2.2) 
The formula for triangles is more complex and given by 

1 1 

tD = Pd,i + 2 • Pd,2 + g • PD,3, 

since for each triangle there is either one wedge in Wd,i, two wedges in Wd,2 or three 
wedges in Wd,3- Fig. 2.1 shows examples of these quantities when the bins are all 
singletons: { 1 } , { 2 } , { 3 } , { 4 }. 

3. Wedge Sampling for Triadic Measures. For a more detailed exposition 
of wedge sampling and empirical tests of its behavior, we refer the reader to [43] . For 
completeness, we review the concepts and calculations here. 



6 



T. G. KOLDA, A. PiNAR, T. Plantenga, C. Seshadhri, and C. Task 




0.02 0.04 0.06 0.08 0.1 



Error (e) 

Fig. 3.1: The number of samples needed for different error rates and different levels of 
confidence. A few data points at 99.9% confidence are highlighted. [43] 



3.1. Hoeffding's Inequality. We begin by stating the classic ChernofF-HoefFding 
concentration inequality for the sum of independent random variables. It will be con- 
venient to use the additive tail bound given by Hoeffding. 

Theorem 3.1 (Hoeffding [20]). Let Xi,X2, ■ ■ ■ ,Xk be independent random vari- 
ables with < Xi < 1 for all i = 1, . . . ,k. Define X = ^ 12i=i -^i- M = E[>?]. 
Then for any e > 0, we have 

Prob{|X- ^1 > e} < 2exp(-2<Vfc)- 

We use this more convenient corollary. 

Corollary 3.2. Let Xi,X2, ■ ■ ■ ,Xk be independent random variables with < 
^ 1 for all i = 1, . . . ,k. Define X = ^ -^i- A* — IE[X]. For any positive 

e,6, setting k — [O.Se^^ ln(2/(5)] yields 

Prob{|X-^| >e] <5. 



We say that e is the error and (1 — 5) is the confidence. Fig. 3.1 shows the 
number of samples needed for different error rates. We show three different curves 
for different confidence levels. Increasing the confidence has minimal impact on the 
number of samples. The number of samples is fairly low for error rates of 0.1 or 0.01, 
but it increases with the inverse square of the desired error. Nonetheless, we shall 
see that the three million samples required for an error rate of e = 0.001 at 99.9% 
confidence is tiny from a MapReduce perspective. 

3.2. Global Clustering Coefficient and Triangles. The clustering coefficient 
c is the fraction of closed wedges. This can be interpreted as the probability that an 
uniform random wedge is closed. Hence, we may apply Hoeffding's inequality to 
obtain the following result. 

Theorem 3.3 (Clustering Coefficient [43]). Fore, 5>Q, set k = [O.Se^^ ln(2/(5)] . 
For i = I,.. . ,k, choose wedge Wi uniformly at random (with replacement) from W 
and let Xi be defined as 

^ _ |l, ifwi is closed, 
1 0, otherwise. 
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Then, 

1 

Prob{|c-c| > e} < (5 for c=-VXi. 

1=1 

Proof. Since c is the proportion of wedges that are closed, E[X] ~ c. The proof 
follows directly from Cor. 3.2. □ 

From Thm. 3.3, we know exactly how many wedges we need to sample (fc) to ob- 
tain a desired error (e) and confidence (1 — 5). Define to be a set of k independently 
sampled wedges from W (with replacement). Then 

1 M-^ \{w \ w closed } I # closed sampled wedges 

c = — > Xi = = . 

k ^—^ \\Y\ # sampled wedges 

From the clustering coefficient, we can infer the number of triangles, per the 
following corollary. 

Corollary 3.4 (Counting Triangles [43]). Let the conditions of Thm. 3.3 hold. 
Then 

Prob{|£-t| > e-p/3} < (5 for i=c-p/3. 



Proof. Since t — c ■ p/3 per (2.1), our result follows directly from Thm. 3.3. □ 
Note that the expected error in the number of triangles is as a proportion of p/3 
(one-third the total number of wedges) rather than a proportion of t. Hence, if t is 
very small compared to p, the relative error \t — t\/t may still be large. 

Choosing Uniform Random Wedges. There is a challenge here in terms of picking 
random wedges. We do not want to form all wedges explicitly. Instead, we implicitly 
generate random wedges. Observe that the number of wedges centered at vertex i is 
exactly (2'), and p = (2')- That leads to the following procedure. First, choose a 
random vertex i with probability (2') /p. Second, choose two of vertex i's neighbors 
(without replacement) to form a random wedge. To set up this distribution, we need 
to compute the degree distribution. 

3.3. Binned Degree-wise Clustering Coefficients and Triangles. The strat- 
egy for computing the clustering coefficient per degree (or degree range) is nearly 
identical to the above. 

Theorem 3.5 (Binned Degree- wise Clustering CoefBcient [43]). For e,5 > 0, set 
k = [0.5 ln(2/5)] . Fori — \,...,k, choose wedge Wi uniformly at random (with 
replacement) from Wd and let Xi be defined as in (3.1). Then 

1 

Prob{|c£) - cnl > e} < ^ for cd^^^X^. 

i—l 

Proof. Observe that cd — E[X] since it is the probability that a random wedge 
in Wd is closed. The proof follows immediately from Cor. 3.2. □ 

To select a random wedge, recall that p^ = |W£)|, so we want to choose vertex 
i G Vd with probability ( 2O /pd- If { d } ( a singleton), then all nodes in Vd are 
equally probable. 
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The estimate for the number of triangles is shghtly more comphcated since each 
closed wedge may have 1,2, or 3 vertices in Vq. 

Theorem 3.6 (Degree- wise Triangle Count [43]). Let the conditions of Thm. 3.5 
hold. For each Wi, let Yi he defined as 



Y 



ifwe Wd,i, 

ifwe Wd,2, 

ifwe Wd,3, 

if w E Wox) (open) . 



Then 

Vroh {\iD - tD\> s-pd] < 5 for i = po ■ ^J^^^' 



1 



i=l 



Proof. We claim E[y] = t^,. Suppose that w is selected from Wd uniformly at 
random. Observe that 

E[r]^Prob{.eH^.,} + ^^{-^^-'^> > Prob{..H^.,} 



1 • 



2 

Pd,i , 1 Poa , 1 



Pd "2 Pd 3 pd 

= to/PD, 

per (2.2). Hence, from Cor. 3.2 we have 

Vroh{\iD/pD - tD/PD\ >s]<5, 

and the theorem follows by multiplying the inequality by pD ■ D 

3.4. Computing a Random Sample of the Triangles. In addition to know- 
ing the number of triangles in a graph, it may also be interesting to consider the 
properties of those triangles. For instance, Durak et al. [17] consider the differences 
in node degrees in a triangle. 

It turns out that the closed wedges discovered during the wedge sampling proce- 
dure are triangles sampled uniformly (with replacements). Hence, we can study these 
randomly sampled triangles to estimate the overall characteristics of triangles in the 
graph. 

Theorem 3.7. Let Ws he a random sample of the wedges of a graph G, and let 
Tg C Ws triangles that are formed by the closed wedges in Ws ■ Then each triangle in 
Ts is a uniform random sample from the triangles of G. 

Proof. The proof depends on observing that a triangle being chosen depends only 
on one of its 3 wedges being chosen. Since the wedge sample is uniformly random, 
each triangle is equally likely to be picked, and there is no dependency between any 
pair of triangles, which implies a uniform sample. □ 

3.5. Practical Performance of Wedge Sampling. Earlier work by a subset 
of the authors [43] provides a thorough study on how the techniques described above 
perform in practice. As expected, tremendous improvements are achieved in runtimes 
compared to full enumeration, especially for large graphs, since the number of samples 
is independent of graph size. Specifically, we see speed-ups of more than lOOOX with 
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errors in the clustering coefRcient ol less than 0.002. Additionally, in comparison to 
the Doulian method (an edge-sampling technique) we obtain speed-ups of 5X or more 
while obtaining the same accuracy. The ability to adapt our wedge-sampling method 
to computing binned degree-wise clustering coefficients and triangle sampling are also 
benefits in comparison to edge-based sampling. 

Our goal in this work is to implement the wedge sampling approach within the 
MapReduce framework and provide evidence that it can scale to much larger problems. 

4. MapReduce Implementation. 

4.1. Overview. We now present a MapReduce algorithm for estimating the 
clustering coefficients and number of triangles in a graph. For details on MapRe- 
duce, we refer the reader to Lin and Dyer [30]; we have emulated their style in our 
algorithm presentations. We use the popular, open-source Hadoop implementation 
of MapReduce, and the Hadoop Distributed File System (HDFS) for storing data. 
Each MapReduce job takes one or more distributed files as input. These files are 
automatically stored as splits (also known sometimes as blocks), and one mapper is 
launched per split. The mappers produce key-value pairs. All values with the same 
key are sent to the same reducer. The number of reducers is specified by the user. 
Each MapReduce job produces a single HDFS output file. A MapReduce job accepts 
configuration parameters, which are passed along as data to the mapper and reducer 
functions; we discuss these in more detail in the sections that follow. The set of 
MapReduce jobs in our algorithm is coordinated by a Hadoop Java program running 
on a single client node. 

In our code, we assume the nodes are binned by degree as discussed in §4.2. Com- 
putation of the global clustering coefRcient is a special case which can be computed by 
either looking only at a single bin containing all degrees or using a weighted average 
of the binned clustering coefficients (see §4.6.2). 

Our input is an undirected edge list where the node identifiers are 64-bit integers; 
we assume no duplicates or self-edges and no particular ordering. We divide our 
MapReduce algorithm into three major phases plus post-processing, as presented in 
Fig. 4.1. Each major phase makes a complete pass through the edge list. The goal of 
the first phase is to set up the distribution on wedges. The goal of the second phase is 
to actually create the sample wedges. Finally, the goal of the third phase is to check 
whether or not the sample wedges are closed. In all three phases, we have strategies to 
reduce the data volume in the shuffie phase (between the map and reduce) , discussed 
in detail in the sections that follow. 

4.2. Binning. We define degree bins in a parameterized way as follows. Let 
T be the number of singleton bins, and let a; > 1 be the rate of growth on the bin 
sizes. The first r bins are singletons containing degrees 1, 2, • • • , r respectively. The 
remaining bins grow exponentially in size. 

We describe the lowest degree of bin k as 



The highest degree for bin k is just one less than the lowest degree of bin fc -I- 1. For 
a given degree d, we can easily look up its bin as 




(4.1) 




d, if d < r, 

[log(l + {oj — l){d — T))/log(w)J + r, otherwise. 



(4.2) 
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Phase la: Compute Degree Per Vertex 



Phase lb: Compute Number of Wedges Per Bin 



Phase 2a: Select Sample Wedge Centers 



Phase 2c: Create Sample Wedges 



Phase 3b: Check Sample Wedge Closure 



XZXhase 4a: Get Degree of Vertex per Wedge 



Phase 4b: Get Degree of 2"^ Vertex per Wedge 



Phase 4c: Summarize Results 



Fig. 4.1: Algorithm overview for estimating clustering coefficients and counting triangles, 
both binned by degree. Red boxes indicate a MapReduce job, while green represents a serial 
operation on the client node. Blue boxes indicate data files. The edge list is provided by the 
user; all other data files are produced by the method. Orange boxes indicate data that is 
passed as a "configuration parameter" to all mappers. Solid lines indicate consumption of 
data. Dotted lines indicate creation of data. Note that there are three phases (la, 2c, and 
3b) that read all edges, which is the most expensive operation. Italicized text indicates an 
optional (sub-)phase. 



In our implementation, r and tu are communicated to each MapReduce job as config- 
uration parameters. 

For T = 2 and cj = 2, the bins are {1}, {2}, {3,4}, {5,6,7,8}, {9, ...,16}, 
{ 17, . . . , 32 }, and so on. Note that the bin { 1 } cannot have any wedges, so we just 
ignore it. Let d be an upper bound on the highest degree for a given graph. Then 
choosing r = 1 and lo = d yields bins {l},{2,...,c?}. In other words, we have a 
single bin containing all vertices (excepting degree- 1 vertices). On the other hand, 
choosing t — d yields {1} ,{2} , . . . ,{d}. Here, every bin is a singleton. 

We are not constrained to equation (4.2) for computing the bins; we can use any 
procedure such that each degree is assigned to a single bin. Likewise, (4.1) is optional 
and used to reduce the shuffle volume in Phase 2c. 

4.3. Phase 1: Compute Degree-based Statistics. 

4.3.1. Phase la: Compute Degree per Vertex. Phase la is a straightfor- 
ward MapReduce task — computing the degree of each vertex. The Map and Reduce 
functions are described in Alg. 1. The input is the edge list file; each entry is a pair 
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Algorithm 1 Compute Degree per Vertex (Phase la) 

method Map{v,'w) t> Input is edge hst file 

Emit(d, 1) 

Emit(w, 1) t> Emit both for an undirected graph 

method Reduce(ii, {xi,X2, ■ ■ ■}) 

d Sum({ xi,X2, • • • }) t> Compute degree 

Emit(u, d) > Output is vertex degree file 



Algorithm 2 Compute Number of Wedges per Bin (Phase lb) 



parameters: r, cj 



> Binning parameters 



method MAp(t;, d) 
b <— BlNlD(d, r, oj) 

p^d- (d- l)/2 
Emit(6, (n,p)) 



> Input is vertex degree file 

> Compute bin ID 

> Number of vertices 

> Number of wedges 



method REDUCE(f), { (ni,pi), (712,^2), • • • }) 
n SuM({ni, n2, • ■ ■ }) 
p ^ Sum({pi,P2, . . . }) 
Emit(6, n,p) 



l> One reduce function per bin 
l> Number of vertices in bin 
t> Number of wedges in bin 
t> Output is wedges per bin file 



of vertex IDs (w,w) that define an edge. The Map function is called for each edge 
{v,w) and emits two key- value pairs keyed to the vertex IDs and having a value of 
1. The Reduce function gathers all the values for each vertex and sums them to 
determine the degree. The final output to HDFS is a vertex degree file; each entry 
is of the form {v, d) where w is a vertex ID and d is its degree. 

Alg. 1 shows a simple version of the code. To make the code more efficient, we 
collect local counts within each mapper (using a Java Map container) and emit the 
totals. This technique is called an in-memory combiner [31]. Wc found the in-memory 
combiner to reduce shuffle volume more than employing the reducer as a combiner. 

4.3.2. Phase lb: Compute Number of Wedges per Bin. Phase lb works 
with the output of Phase la (vertex degree file) to compute the number of wedges 
per bin. The Map and Reduce functions for Phase lb are presented in Alg. 2. The 
input is the list of degrees per vertex. The Map function is called for each vertex 
(with its associated degree) and emits the number of wedges for that vertex, keyed 
to the appropriate bin. The Reduce function simply combines the results for each 
bin. The final output is a wedges per bin file; each entry is of the form {b,ni,,pb) 
where b is the bin ID, ni, is the number of vertices in the bin, and pb is the number of 
wedges in the bin. 

Once again, we have shown a simple version of the algorithm in Alg. 2. To make 
the code more efficient, we collect local counts within each mapper (using a Java Map 
container) and emit the totals. 

For the case of a single bin, strictly speaking. Phase lb is unnecessary. Instead, 
we could have used a Hadoop global counter to tally the total wedges in the reduce 
step of Phase la. 
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Algorithm 3 Determine number of samples per vertex (Phase 2a) 



parameter: k t> Desired number of samples per bin 

parameter: 6 t> Represents wedges per bin object 



method MAp(t;, d) 
b ^ BlNlD(d) 

q*^(d.(d_l)/2).fc/p 

X RAND ([0, 1]) 

q^{x>{q*-[q*\)} ? \q*] : [q* \ 
if g > 1 then 



l> Input is vertex degree file 

> Compute bin ID 
o Total number of wedges in bin containing v 
> Ideal number of samples, likely noninteger 
l> Uniform random number in [0, 1] 
t> Number of sample wedges centered at v 
t> Skip vertices with no samples 
t> Output is wedge centers file 



Emit(ii, d, q,p) 
end if 



4.3.3. Phase Ic: Gather Wedges per Bin. From the wedges per bin file 

(output of Phase lb), we create a wedges per bin object, which acts as a function 9 
such that 9{b) is the number of wedges in bin b. The work is performed entirely in our 
main program running on the client node. It reads Phase lb output from HDFS, stores 
wedges per bin values in a Java Map container, and launches the next MapReduce job 
(Phase 2a), sending the serialized container as a configuration parameter. 

4.4. Phase 2: Select Wedge Samples. 

4.4.1. Phase 2a: Select Sample Wedge Centers. The input to Phase 2a is 
the vertex degree file along with the wedges per bin object, which is passed as a 
configuration parameter. Phase 2a calculates the number of sample wedges centered 
at each vertex. The Map function is shown in Alg. 3. The Map function is called 
for each (vertex ID, degree) pair. From this, we can calculate the expected number 
of wedges that would be sampled from the vertex for a uniform random sample, q* . 
This number is unlikely to be integral. Rounding up would produce far too many 
wedges. Instead, we use probabilistic rounding. For instance, if q* = 0.1, then there 
is a 10% change of producing q — 1 wedges and a 90% chance of producing no wedges, 
q = 0. We are only off by at most one, so if q* = 1.1, then there is a 10% change 
of producing q = 2 wedges and a 90% chance of producing q = 1 wedge. Hence, 
the expected number of wedges for this vertex is exactly q* . Only vertices with at 
least one sample wedge are emitted. The final output is a wedge centers file; each 
entry is of the form (v,d,q,p) where v is the vertex ID, d is the vertex degree, q is 
the number of sample wedges centered at that vertex, and p is the total number of 
wedges in the bin containing v. The Reduce function is just the identity map and is 
not shown. 

4.4.2. Phase 2b: Gather Sample Wedge Centers. Phase 2b is an optional 
step that generates a Java Map of wedge centers and their bin IDs based on the output 
of Phase 2a (wedge centers file). We represent this object as a function 7 such that 



This wedge centers object has one value for every vertex appearing in a wedge 
center. It is serialized and passed as a configuration parameter to Phase 2c, where it 
is used to filter the edges that are emitted by the Map hmction. 




b>2 



if Phase 2b is skipped, 
if V is not a wedge center, 

if w is a wedge center, in which case b is the bin ID. 
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Note that Hadoop imposes a limit on the size of the configuration parameters 
(5MB by default). If the number of wedge centers is too large (a few hundred thousand 
samples will exceed 5MB), then other options must be explored. One alternative is 
to pass the container to the Map tasks using the Hadoop distributed cache; however, 
we have not implemented this idea. 

Phase 2b is optional, and can be skipped if there are too many wedge centers. 
We demonstrate the benefits of this step in §5. 

4.4.3. Phase 2c: Create Sample Wedges. In Phase 2c, the goal is to take 
each sample wedge center (from the wedge center file), collect its neighbors (from 
the edge list file), and create a set of sample wedges. We merge each vertex and 
its neighbors at the reduce phase. If it exists, the optional wedge center object 
is used to filter the edges that are shuffled, ignoring all edges that are not adjacent 
to a sampled wedge center. The algorithm is shown in Alg. 4. For clarity, we give a 
separate Map function for each input type. In the actual implementation, we have 
to determine the input type on the fly, because both input files are of Hadoop type 
Text. For input from the wedge centers file, the Map function simply passes along 
its degree and sample wedge count (i.e., the number of wedges to be sampled from 
the vertex). 

For input from the edge list file, the Map function checks to see if the edge is 
adjacent to a wedge center. If so, it is passed based on the outcome of a random coin 
flip. The aim of the Reduce phase is to generate random wedges centered at a vertex 
(say v). The most naive Map implementation would forward all edges incident to v, 
so that wedges can be selected from them. A major problem with this is that if the 
number of samples k is much less than the degree of w, most of the communication is 
unnecessary. For example, the highest degree vertices of a social network graph might 
link to millions of edges, but k is in the tens of thousands or less; therefore, most of 
the incident edges will not participate in sampled wedges centered at these vertices. 
We have a probabilistic fix to address this situation. 

We do not have the vertex degree readily available, but we do know its bin and 
therefore a lower bound on its degree. Consider a vertex v of degree greater than 
dmin, where 2k < (iniin/2. We send just some of the incident edges to v, with inde- 
pendent probability (p = 4fc/dniin < 1- Then the expected number of edges to send 
is 4A:(d^/dmin)- Note that this expectation is at least ik. Getting less than 2k edges 
is potentially disastrous, but the probability of this is minuscule. By a multiplicative 
Chernoff bound (given below), the probability of such an event is exp(— A:/8). For 
k = 1000 (a tiny sample size), the probability is less than 10~^^. 

Theorem 4.1 (Multiphcative Chernoff Bound [16]). Let X = J2i<r^i' where 
each Xi is independently distributed in [0, 1] . Then 

Prob{X < (1 - <5)E[X]} < cxp{-6^E[X]/2). 

If dy is not too far from dmin, then the expectation Ak{dy / d^i,^) is potentially much 
smaller than dy. Hence, we get the desired number of random edges without sending 
too many. 

Even with this improvement, the data passed forward may be too large to fit into 
the reducer's memory. We use a feature of Hadoop called secondary sort to ensure 
that the data arrives pre-sorted. Note that the key used for passing along the vertex 
information is v.O and the key for the edges is of the form v:y, where y is a random 
positive integer. This data is all mapped to the key v, but the values following the 
colon control the sort of the values associated with v. The secondary key of zero 
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ensures that the degree and wedge count data are first. The secondary keys for edges 
(y) ensure that the adjacent edges are randomly sorted; otherwise, Hadoop would 
present the edges in their order of arrival, which could bias the selection. 

From the secondary sort, the wedge center must be first in the values list at the 
reduce phase, if it exists. If it does not exist, then there is nothing to do. Recall 
that for each wedge center v, we have its degree, d, and a desired number of wedge 
samples, q. Each wedge must be randomly sampled with replacement. The two edges 
of a single wedge are sampled without replacement. So, wedge sampling requires a 
minimum of 2 and a maximum of 2q edges. If 2q > d, some edges are necessarily 
reused. On the other hand, if 2q <C d, then we expect every wedge centered at v 
to have two unique edges. For large d, we want to avoid reading all neighbors since 
the list is quite long. We do this by using a simulated sampling procedure explained 
below. It only requires us to read first d' neighbors where d' < min {d,2q}. 

We need to produce q uniform random wedges (with replacement) centered at 
V. This is done through the procedure Sampling. Imagine the edges incident to 
V numbered arbitrarily from 1 to d. A uniform random wedge is represented as a 
uniform random pair of indices (i, j) {i ^ j,i < d,j < d). We can repeat this q times 
to implicitly sample q random wedges, each of which is just represented as a pair of 
indices. Observe that the total number of indices in the union of these pairs is at 
most d' , so all we need are the d' uniform random edges obtained as the output of the 
Map phase. We map these indices randomly to the index set {1, 2, . . . , d'} through a 
permutation. Now, each wedge is indexed as a pair {i,j) (i j,i < d',j < d'). From 
the list of edges/neighbors {xi,X2, ■ ■ ■ ,Xd'}, we can generate these random edges. 
This is what is done in Sampling and Reduce in Alg. 4. 

The final output of this phase is a sample wedge list file, where each entry is 
of the form {h,va,vi,V2,p, do). The number /i is a hash of the desired closure edge 
ivi,V2) that is used for load balancing, the wedge is defined by {vi,vo,V2), p is the 
total number of wedges in the bin containing vq, and do is the degree of vertex vq. 

As mentioned above. Phase 2b is optional. If skipped, then we let the MapReduce 
shuffle bring adjacent edges of a wedge center together in the reduce phase. We defer 
calculation of the number of sample wedges to the reduce phase, but otherwise proceed 
as defined above. Note that in many cases the reducer collects zero samples and does 
no work. 

4.5. Phase 3: Check Sample Wedge Closure. 

4.5.1. Phase 3a: Gather Sample Wedge Closure Hashes. Phase 3a (op- 
tional) assembles a list of all the unique edge hashes from the sample wedges 
file and stores it as a Java Set object. We denote this wedge hashes object by 
^ = { /ii, /i2, . ■ ■ }. This is very similar to the procedure in Phase 2b, which assembles 
the list of wedge centers. We set ^ = if Phase 3a is skipped. 

4.5.2. Phase 3b: Check Sample Wedge Closure. Phase 3b is the last major 
step and checks the wedge closures, as shown in Alg. 5. The inputs are the sample 
wedges file created by Phase 2c and the original edge list file. We also pass the 
optional wedge hashes object (^) as a configuration parameter. If ^ is nonempty, it 
is used to filter the edges passed forward to the reduce function. (Note that we could 
skip Phase 3a and forward every edge forward to the reducers, but this would result 
in much greater data shuffling in Phase 3b.) Note that more than one edge may hash 
to the same value; hence, we loop through all edges that arrive at the reducer to verify 
that there is a match before declaring a wedge as closed. Likewise, more than one 
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Algorithm 4 Create Sample Wedges (Phase 2c) 



parameters: t,oj 
parameter: k 
parameter: 7 

method Map(i', d, q,p) 
EMIT(t;:0, {d,q,p)) 



t> Binning parameters 

> Desired number of samples per bin 

> Represents wedge centers object 

t> Input is wedge centers file 
> Note secondary sort key 



method MAp(t),w) 

EDGEHELPER(t;, w) 

EdgeHelper(w, v) 



t> Input is edge hst file 



method EdgeHelper(-!;, w) 
b •«— f{v) 
if 6 = then 

(/) ^ 1 
else if & = 1 then 

else 

= BinLoDeg(6, r,a;) 

(t)^2- {2k/ drain) 

end if 

X Rand([0, 1]) 
\i X < 4> then 

y <~ Rand({ 1, . . . ,maxlongint }) 

Eun{v:y, w) 
end if 



> Extract bin ID 
t> Phase 2b was skipped 

> Always emit the edge 
t> Vertex is not a wedge center 

> Never emit the edge 

> Vertex is a wedge center 
t> Lower bound degree of v 

> Proportion of edges to emit for v 

> Uniform random number in [0, 1] 

> Probabilisticly downselect 

> Random long integer 

> Note secondary sort key 



method Reduce(v, {xi,X2, ■ ■ ■}) 

if xi is a wedge center then t> If it exists, the wedge center information is first 

(d, q,p) <^ Xi > Unpack wedge center information 

(d', { [it, it) Yi^i)) ^ SAMPLiNG(d, q) t> Determine sample wedges 

{wi, . . . , Wd' } <— { X2,X3, . . . , Xd'+i } > Read only d' neighbors 

for each i = 1, . . . ,q do 



h ^ HASH(u;i^ , J 
EMlT{h,v,Wi^,Wj^,p,d) 
end for 
end if 



> Hash of edge that would close this wedge 
> Output is sample wedges file 



method SAMPLlNG(d, q) 

for each £ = 1, . . . , g do 
ie Rand { 1, ...,d} 
ji ^ RAND{l,...,d}\{ie} 

end for 

S ^ {il,...,ig}Ll{jl,...,jg} 

d' ^ \s\ 

Define mapping tt : <S — ^ { 1, . . . , d' } 



> Subroutine for simulated sampling 

> Generate endpoints for each wedge 



> Gather unique indices (duplicates removed) 
> Number of edges needed 
> Renumber from 1 to d' 



return d' and pairs { {'!T{ie),-!r{je)) 
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Algorithm 5 Check Sample Wedge Closure (Phase 3b) 



parameter: (, — {hi,h2, ■ ■ ■} t> Represents wedge hashes object 

method MAp{h,vo,vi,V2,p,do) t> Input is sample wedges file 

EMIT(/i, {vo,Vi,V2,p,do)) 

method Map{{'Wi,'W2)) > Input is edge list file 

h Hash(u)i,TO2) Hash of edge 

if = 0) or {h e C) then 

EMIT(/i, (wi,W2)) 

end if 

method Reduce(/i, {xi,X2, ■ ■ ■}) 

Sort the values { xi, X2, ■ ■ ■} into £ (edges) and W (wedges) 
for each w £W do 

(vo, vi,V2,p,do) -ir- w t> Unpack wedge data 

a -f- "open" > By default, wedges are open 

for each e £ £ do 

{wi,W2) e l> Unpack edge data 

if (wi — vi and W2 = V2) or {w2 — vi and Wi = 112) then 

a "closed" 
end if 
end for 

EMlT{a,vo,vi,V2,p,do) t> Output is results file (ver. 0) 

end for 



wedge may be closed by a single edge. The output of this phase is the results file 
(ver. 0); each entry is of the form (cr, vq, vi, V2,p, do) where a indicates if the wedge 
is open or closed and everything else is the same as for the sample wedges file. 

4.6. Phase 4: Post-processing. 

4.6.1. Phases 4a & 4b: Find degrees of wedge endpoints. Phases 4a & 4b 
augment each sample wedge with the degrees of Vi and V2- This information is needed 
for estimating the number of triangles per bin. If only the clustering coefficients are 
required, these two steps can be omitted. Alg. 6 shows Phase 4a; the procedure for 
Phase 4b is analogous and so is omitted. The final output of Phase 4b is the results 
file (ver. 2); each line is of the form {a,vi,Vo,V2,p,do,di,d2) where di and ^2 are 
the degrees of vertices vi and V2 , respectively, while the remainder is the same as for 
the results file (ver. 0). 

4.6.2. Phase 4c: Summarize Results. Phase 4c tallies the final results per 
bin, using the logic in Alg. 7. Its output is the summary file. Each line is of the 
form b, qq, 91 , ^2 i 93 i c,p, t where b is the bin ID, qq is the number of open wedges, qi is 
the number of closed wedges with i vertices in the bin, c is the clustering coefficient 
estimate, p is the number of wedges in the bin, and t is the estimated number of 
triangles with one or more vertices in the bin. 

We can estimate the global clustering coefhcient from the degree-binned clustering 
coefficients as follows. Let Ch and pi, be the clustering coefficient estimate and total 
number of wedges for bin 6. Let p — X^b-Pfc be the total number of wedges. Then the 
estimates for the global clustering coefhcient and total number of triangles are given 
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Algorithm 6 Find Degree of First Vertex per Wedge (Phase 4a) 



method MAp(cr, uq, ui, U2,P, do) 
Emit(i;i:1, {a,Vo,Vi,V2,p, do)) 



t> Input is results file (ver. 0) 



method Map(ii, d) 
EMlT('i;:0, d) 



> Input is vertex degree file 



method Reduce(ii, {xi,X2, ■ ■ ■}) 
di 4— xi 

for each x £ { X2, xs . . .} do 

EMIT(a;, di) 
end for 



o Add the degree of vi for each wedge. 
\> First value is the degree of the vertex 
> Remaining values, if any, comprise sample wedges 
> Output is results file (ver. 1) 



Algorithm 7 Summarize Results (Phase 4c) 
parameters: r, oj 

method MAp(cr, vq, vi,V2,p, do, rfi, ci2) 
bo BlNlD(do, T, uj) 
bi •<— BlNlD(di, r, uj) 
62 BlNlD(d2, r, Ld) 
Emit(&, (a, p, 61,62)) 



t> Binning parameters 
> Input is results file (ver. 2) 



method REDUCE(fe, {xi,X2, ■ . ■}) 
po,Pi,P2,P3 ^ 
for each x £ { xi, X2 ■ ■ ■} do 

(cr, p, 61, 62) 2: Unpack value 

if cr = "open" then 

170 go + 1 
else 

i^l + {b = bl) + {b = 62) 
+ 1 

end if 
end for 

c <- (?i + 92 + <?3)/(go + qi+q2 + qa) 

J p • ((Ji + 52/2 + q3/3)/{qo + qi+q2 + qs) 

Emit(6, go, 9i, ?2, 93, c,p, f) o Output is summary file 



by 

c~'S^—-Cb, and i=c-^. (4-3) 
, P 3 

Let bniax denote the total number of bins. We assume that every bin has k samples 
producing an error bound of s with confidence (1 — S). Then we argue that |c — c| < e 
with confidence (1 — &niax • S)- 

5. Experimental Results. 

5.1. Data Description. We obtained real-world graphs from the Laboratory for 
Web Algorithms (http : //law. di . unimi . it/datasets .php), which were compressed 
using LLP and WebGraph [9, 7]. We selected ten larger graphs for which the complete 
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edge lists were available. We also consider three artificially-generated graphs according 
to the GraphSOO benchmark [55], which uses Stochastic Kronecker Graphs (SKG) [28] 
for its graph generator with [0.57,0.19;0.19,0.05] as the 2x2 generator matrix. We 
have added noise with a parameter of 0.1, as proposed in [42, 43] to avoid oscillatory 
degree distributions. These graphs are generated in MapReduce. All networks are 
treated as undirected for our study; in other words, if a; — y, ?/ — x, or both, we say 
that edge {x,y) exists. Briefly, the networks are described as follows. 

• amazon-2008 [9, 7]: A graph describing similarity among books as reported 
by the Amazon store. 

• ljournal-2008 [12, 9, 7]: Nodes represent users on LiveJournal. Node x con- 
nects to node y it x registered y as a friend. 

• hollywood-2009, hollywood-2011 [9, 7]: This is a graph of actors. Two actors 
are joined by an edge whenever they appear in a movie together. 

• twitter-2010 [25, 9, 7]: Nodes are Twitter users, and node x links to node y 
if y follows X. 

• it-2004 [9, 7]: Links between web pages on the .it domain, provided by IIT. 

• uk-2005-05, uk-2006-06, uk-union-2006-06-2007-05 (shorted to uk-union) [8, 
9, 7]: Links between web pages on the .uk domain. (We ignore the time 
labeling on the links in the last graph.) 

• sk-2005 [6, 9, 7]: Links between web pages on the .sk domain. 

• graph500-23, graph500-26, graph500-29 [55, 28, 42, 37]: Artificially generated 
graphs according to the Graph500 benchmark using the SKG method. The 
number (23, 26, 29) indicates the number of levels of recursion and the size 
of the graph. 

The properties of the networks are summarized in Tab. 5.1; specifically, we report the 
number of vertices, the number of undirected edges, the total number of wedges, and 
estimates for the total number of triangles and the global clustering coefficients, cal- 
culated according to (4.3). (To the best of our knowledge, we arc the only group that 
has calculated the last three columns, so these numbers have not been independently 
validated.) 



ID 


Graph Name 


Nodes 
(millions) 


Edges 
(millions) 


Wedges 
(millions) 


Triangles 
(millions) 


GCC 


1 


amazon-2008 


1 


4 


51 


4 


0.2603 


2 


ljournal-2008 


5 


50 


9,960 


408 


0.1228 


3 


hollywood-2009 


1 


56 


47,645 


4,907 


0.3090 


4 


hollywood-2011 


2 


114 


120,899 


7,097 


0.1761 


5 


graph500-23 


5 


128 


567,218 


3,673 


0.0194 


6 


it-2004 


41 


1,027 


16,163,308 


48,788 


0.0091 


7 


graph500-26 


34 


1,054 


9,087,164 


28,186 


0.0093 


8 


twittcr-2010 


42 


1,203 


123,435,590 


34,495 


0.0008 


9 


uk-2006-06 


80 


2,251 


16,802,569 


186,453 


0.0333 


10 


sk-2005 


43 


2,543 


5,196,166,169 


256,556 


0.0001 


11 


uk-2006-05 


77 


2,636 


167,591,218 


363,111 


0.0065 


12 


uk-union 


132 


4,663 


203,567,548 


447,133 


0.0066 


13 


graph500-29 


240 


8,502 


158,727,767 


272,931 


0.0052 



Table 5.1: Network characteristics. All edges are treated as undirected. The triangle counts 
and global clustering coefBcients (GCC) are our estimates. 
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Fig. 5.1: Runtimes broken down by phases 



5.2. Experimental Setup. We have conducted our experiments on a Hadoop 
cluster with 32 compute nodes. Each compute node has an Intel 17 930 CPU at 
2.8GHz (4 physical cores, HyperThreading enabled), 12 GB of memory, and 4 2TB 
SATA disks. All experiments were run using Hadoop Version 0.20.203.0. Unless 
otherwise stated, all experiments use the following parameters: 

• Number of samples per bin: k = 10, 000 

• Bin parameters: r = 2 and uj = 2 (i.e., bins are { 2 } , { 3, 4 } , { 5, 6, 7, 8 } , . . . .) 

• Number of reducers: 64 



5.3. Experimental Results and Timings. We ran our MapReduce code on 
the 13 networks described in §5.1. The runtimes are reported in Fig. 5.1, broken 
down by the phases of the algorithm. The largest real-world graph, uk-union (#12) 
with over 100 million vertices and over 4.6 billion edges, took less than 30 minutes 
to analyze. For all the networks, the most expensive step is Phase la, calculating 
the degree per node, because every edge in the edge list file generates two key- 
value pairs. For uk-union, this step takes over 12 minutes. Phases lb and 2a are 
essentially constant time (approximately 30 seconds) because they process only the 
vertex degree file. Phase 2c (Create Sample Wedges) is the next most expensive 
step. Here we collect the edges adjacent to wedge centers, reading the entire edge list 
file again in the map phase; however, the wedge centers file created in Phase 2b 
minimizes the number of edges that are passed on to reducers. Nevertheless, for the 
wedge centers that are high degree, many edges are transmitted (though substantially 
less than the entire edge list). Phase 3b (Check Sample Wedge Closure) is the next 
most expensive, again reading the entire edge list file. The last few postprocessing 
steps are close to constant time (approximately 30-60 seconds each). 
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Correlation 

4000 I ■ ■ ■ ■ 1 




Edges (millions) 

Fig. 5.2: Correlation of the runtime to the number of edges. There is a constant cost of 225 
sec, which accounts for the MapReduce overhead and then an incremental cost of 0.33 sec. 
per miUion edges. 



In an enumeration approach, we expect the runtimes to be proportional to the 
number of wedges. In that case, the sk-2005 (#10) would be the most expensive by 
an order of magnitude. For our method, however, the runtime is proportional to the 
number of edges. Fig. 5.2 shows that there is a near-linear relationship between the 
number of edges and the runtime. The x-axis is the number of edges (in millions) 
and the y-axis is the runtime. All 13 examples are included. We see that there is a 
constant cost of 225 sec, which accounts for the MapReduce overhead and then an 
incremental cost of 0.33 sec. per million edges. 

Comparisons to Other Methods. As one point of comparison, we ran Plantcnga's 
implementation that fully enumerates triangles [38] on several smaller graphs. The 
results are summarized in Tab. 5.2. This implementation is efficient because it only 



Id 


Graph Name 


Wedges 
Checked 


Run Time 
(sec.) 


Exact 
GCC 


SampUng 
Error 


SampHng 
Speed-up 


1 


amazon-2008 


25% 


158 


0.2603 


0.0001 


1 


2 


ljournal-2008 


19% 


3,385 


0.1228 


0.0010 


11 


3 


hollywood-2009 


29% 


21,665 


0.3090 


0.0006 


72 


4 


hollywood-2011 


27% 


90,598 


0.1761 


0.0006 


293 



Table 5.2: Comparison of sampling and exact enumeration. 



checks wedges where the center degree is smallest, i.e., {u,v,w) such that dy < d^ 
and d^ < dw This means that only one wedge per triangle is checked and moreover 
that many open wedges centered at high-degree nodes are ignored. The table lists 
the percentage of wedges that are checked for closure. The run times range from 
3 min. for the smallest graph, up to 25 hrs. for hollywood-2011 (2M nodes, 114M 
edges). Because this code enumerates every triangle, we can calculate the exact global 
clustering coefficient (GCC). The error from our sampling method is also reported. At 
a confidence level of 99.9%, using k — 2000 samples yields an error of e = 0.05. The 
true errors are one to two orders of magnitude less than this worse case probabilistic 
bound. Finally, we observe the main advantage of sampling in terms of the observed 
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speed-up, up to 293X for hollywood-2011. Larger graphs cannot be completed in a 
reasonable amount of time on our cluster. As previously noted, Suri and Vassilvitskii 
[44] have another enumeration method that was able to process the twitter-2010 data 
set in 483m on a 1636-node Hadoop cluster. 

The other method class of methods for comparison are edge-sampling methods 
such as Doulion [48]. The basic idea is to sample a subset of edges and then run a 
triangle counting method (such as enumeration) on the reduced graph. Edge sam- 
pling has been compared to wedge sampling in serial [43]. Keeping 1 in 25 edges 
produces results that are roughly comparable to wedge sampling in time but with 
much greater variance in the GCC estimate. Keeping fewer edges yields savings in 
time but at the expense of much greater variance. Hence, we have not compared 
parallel implementations. 

Impact of Implementation Features. We have considered many alternatives during 
the implementation of the wedge sampling algorithm, and in this subsection we present 
the impact of two implementation features. The three versions of the code we compare 
are: (1) Original algorithm. (2) Skip Phase 2b. (3) Skip Phase 3a. We show results 
for uk-union in Fig. 5.3. Skipping Phase 2b means that every edge generates two 
messages in Phase 2c, increasing the time in that phase from 372 sec. to 1235 sec. (3X 
increase), twice as expensive as Phase la. Skipping Phase 3a means that every edge 
generates one message in Phase 3b, increasing the time in that phase from 207 sec. 
to 543 sec. (2.5X increase). Hence, taking measures to reduce the data that must be 
shuffled to the reducers has major pay-offs in terms of performance. 



Timing Breakdown 




1 2 3 
Variation 



Fig. 5.3: Timings for variations. The variations are (1) Original algorithm, (2) Skip Phase 
2b, (3) Skip Phase 3a. 



5.4. Degree Distribution. The output of Phase lb yields the (binned) degree 
distribution. We show results for 12 networks in Fig. 5.4 (we omit uk-2006-05 because 
it is very similar to uk-2006-06). For each data point, the x-coordinate is the minimum 
degree of the bin, and the y-coordinate is the total number of vertices in that bin. 

The degree distributions all so something that can be roughly characterized as 
heavy-tailed. None of the real-world graphs are particularly smooth in the degree 
distribution, and some have odd spikes, especially in the tails (e.g., sk-2005, it-2004. 
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Degree Distribution Degree Distribution Degree Distribution 




Degree Degree Degree 





uk- union). The artificial graphs (graph500-23/26/29) are extremely smooth; we know 
from analysis that the noisy version of GraphSOO yields lognormal tails [42, 43]. 

5.5. Clustering Coefficients. We compute the clustering coefficients for each 
bin. These are displayed in Fig. 5.5. Here the x-coordinate is the minimum degree 
in the bin, and the y-coordinate is the average clustering coefficient for wedges with 
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centers in that bin. 

Social networks are well-known to have not only high clustering coefficients, but 
also clustering coefficients that tend to degrade as the degree increases. This can be 
seen in the following graphs: amazon-2008, ljournal-2008, hollywood-2009, hollywood- 
2011. The twitter-2010 graph is not as "social" in terms of clustering coefficient. 

The web graphs (it-2004, uk-2006-06, uk-union, sk-2005) are interesting because 
the clustering coefficients seems to start low, increase, and then drop off very quickly. 
This may be due to the design of web sites with many interconnected pages or to 
some artifact of the crawling process. 

The GraphSOO examples have overall low clustering coefficients and do not seem 
to behave like the real-world graphs. The closest match is the twitter-2010 graph. 

5.6. Triangles. We also measure the number of triangles per bin in Fig. 5.6. 
Here the x-coordinate is the minimum degree in the bin, and the y-coordinate is the 
proportion of triangles that have at least one vertex in that bin. Triangles may be 
counted more than once if they have different vertices in different bins. 

It is interesting to observe where triangles come from. Even though low-degree 
nodes are the most plentiful, most of the triangles come from higher- degree. We can 
roughly sort the graphs into three categories. 

There is only one graph where the triangles come from relatively low-degree ver- 
tices: amazon-2008. Here it seems that most triangles come from nodes with degrees 
between 4 and 80. 

There are several graphs where the triangles come from the mid-range degrees: 
one social graph (ljournal-2008) and ah the web graphs (it-2004, sk-2005, uk-2006-06, 
uk-union). The "double-spike" behavior of sk-2005 is interesting. 

Finally, there are a few graphs where the vast majority of triangles involve the 
high-degree nodes. Both Hollywood graphs are of this type; note that 60% of the 
triangles involve one of the nodes in the bin starting at degree 1025. The Graph500 
graphs also have most of the triangles coming from the highest degree nodes. 

5.7. Triangle Statistics. An interesting feature of our wedge sampling tech- 
niques is that, in the case of a single bin, all the closed triangles are uniformly ran- 
domly sampled as well. Such a random sample can be used to analyze the character- 
istics of the triangles in the graph, going further than merely looking at their count. 
Examples of such studies can be found in [17], where full enumeration of the triangles 
was used. To avoid the burden of full enumeration a uniform sampling of the triangles 
can be used, as we showcase below. 

For four example graphs, we ran our MapReduce code with a single bin (r = 1 
and uj = 10^) and k — b, 000, 000 samples; we skipped phases 2b and 3a to avoid any 
data overflow problems in the configuration parameters. Runtimes and the number 
of triangles (expected to be roughly k times the global clustering coefficient) are 
reporting in Tab. 5.3. 



Graph Name 


Time(s) 


Triangles 


uk-union 


2618 


33398 


hollywood-2011 


348 


878719 


graph500-26 


845 


46047 


graph500-29 


5487 


25994 



Table 5.3: Number of triangles from 5,000,000 wedge samples. 
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Degree Degree Degree 

Fig. 5.5: Clustering coefficients by bin 



Using these sampled triangles, we can look at the degrees of the vertices. Each 
triangle has a minimum, middle, and maximum degree. We analyze the degree as- 
sortativity of the vertices of the triangles by comparing the minimum and maximum 
degrees in Fig. 5.7. Specifically, we assign each vertex to a degree bin, using (4.2) with 
r = 2 and u — 2. We group all triangles with the same minimum degree bin together. 
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Fig. 5.6: Proportion of triangles in each bin 



Degree 



The box plot shows the statistics of the bin for the maximum degree: the central 
mark (red) is the median max-degree, while the edges of the (blue) box are the 25th 
and 75th percentiles. The whiskers extend to the most extreme points considered not 
to be outliers, and the outliers (red plus marks) are plotted individually. 

Observe that the social network, hollywood-2011, shows an assortative relation 
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between the maximum and minimum for the hoUywood graph, since the two quantities 
rise gradually together. For the uk-union web graph on the other hand, the average 
maximum degree is essentially invariant to the minimum degree. These findings are 
consistent with the results in [17] about networks with high global clustering coeffi- 
cients having degree assortative triangles, while this assortativity cannot be observed 
in networks with low clustering coefficients. Here, we were able to observe the same 
trend on these massive graphs using sampling in a much more efficient way, avoiding 
the enumeration burden. We see that the GraphSOO networks also have almost no 
assortativity between the minimum and maximum degrees and therefore do not have 
the characteristics of a social network. 



Triangle Degree Assortativity: tioliywoorJ-201 1 Triangle Degree Assortativity: uk-union 




2 3 4 5 6 7 8 9 10 11 12 13 14 15 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 

Degree bin o1 minimum triangle degree Degree bin of minimum triangle riegree 




Fig. 5.7: Triangle Degree Assortativity 



6. Conclusions. We have shown that wedge-based sampling can be scaled to 
massive graphs in the MapReduce framework. On a relatively small MapReduce 
cluster (32 nodes), we have analyzed graphs with up to 240M edges, 8.5B edges, 
5.2T wedges, and 447B triangles. Even the largest graph was analyzed in less than 
one hour, and most only took a few minutes. Fig. 6.1 shows a timing analysis of the 
MapReduce tasks [53] for Phase la on the uk-union graph. Mapper tasks run in waves 
of 128 parallel jobs, equal to the number of mapper slots available on the cluster. Note 
that a larger cluster would be able to run more Map tasks in parallel, decreasing the 
overall runtime. To the best of our knowledge, these are the largest triangle-based 
calculations performed to date. 

Unlike enumeration techniques that need to at least validate every triangle and 
more often have cost proportional to the number of wedges, our method is linear in 
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2013 Jan 7 17:38:33 

Time [hr:min:5ec] 

Fig. 6.1: Task breakdown for Phase la on uk-union on 32 Hadoop nodes. 



the number of edges. The most expensive component of the wedge-based samphng is 
finding the degree of each vertex (Phase la); reducing the time for this is a topic for 
future study. On our cluster, the time is approximately 0.33 sec. per million edges, 
plus a fixed cost of 225 sec. for overhead. Because we are using MapReduce, we never 
need to fit the entire graph into memory — we only need to be able to stream through 
all the edges. 

Our MapReduce implementation requires a total of eight MapReduce jobs, three 
of which do most of the work because they read the entire edge list (Phases la, 2c, and 
3b) and two of which are optional (Phases 4a and 4b, which are labeling the degrees 
of the sampled triangles) . We have striven to minimize the data being shuffled in each 
phase by using special data structures to filter the edges. 

Using our code, we are able to compute the degree distribution, approximate the 
binned degree-wise clustering coefflcient and the number of triangles per bin. Addi- 
tionally, we can analyze the characteristics of the triangles (e.g. degree assortativity). 
As part of our analysis, we have analyzed the graphs used in the Graph500 supercom- 
puter benchmark. We are able to give a more detailed understanding of the empirical 
properties of the generator and compare it to real-world data; this is potentially very 
helpful in determining if performance on the benchmark data is indicative of expected 
performance on real-world data. 
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